Classifying English Documents by National Dialect
نویسندگان
چکیده
We investigate national dialect identification, the task of classifying English documents according to their country of origin. We use corpora of known national origin as a proxy for national dialect. In order to identify general (as opposed to corpus-specific) characteristics of national dialects of English, we make use of a variety of corpora of different sources, with inter-corpus variation in length, topic and register. The central intuition is that features that are predictive of national origin across different data sources are features that characterize a national dialect. We examine a number of classification approaches motivated by different areas of research, and evaluate the performance of each method across 3 national dialects: Australian, British, and Canadian English. Our results demonstrate that there are lexical and syntactic characteristics of each national dialect that are consistent across data sources.
منابع مشابه
Classifying and Clustering Dialects of North American English
This paper presents the results of experiments in which machine learning techniques were applied to the problem of determining regional dialect boundaries. Specifically, decision trees classification and k-means clustering were applied to a corpus of phonetic measurements taken from a large survey of North American English vowels. Pairwise classification and clustering experiments were done for...
متن کاملDo Web Corpora from Top-Level Domains Represent National Varieties of English?
In this study we consider the problem of determining whether an English corpus constructed from a given national top-level domain (e.g., .uk, .ca) represents the national dialect of English of the corresponding country (e.g., British English, Canadian English). We build English corpora from two top-level domains (.uk and .ca, corresponding to the United Kingdom and Canada, respectively) that co...
متن کاملTowards the Design of the Australian National Corpus
Corpora are becoming more and more important as a research tool for linguists as they are large collections of authentic text. However, not every researcher has the time and resources to compile their own corpus. Large corpora in the world such as the BNC, the ANC or the International Corpus of English (ICE) have been widely used for research on the English language in general or an English dia...
متن کاملAn Analysis of Ministry of Education’s Strategic Plans Based on Favorable Components of English Language Teaching Using Shannon’s Entropy
The present research aims to analyze the content of Ministry of Education’s strategic plans (the Fundamental Reform Document of Education, the Comprehensive National Scientific Plan and the National Curriculum Document) based on Shannon's entropy regarding the favorable components of teaching English. The contents of the Fundamental Reform Document of Education, the Comprehensive National Scien...
متن کاملThe use of shibboleth words for automatically classifying speakers by dialect
Real-world applications using speech recognition must perform well over a range of dialects. Di erences in dialect between the speakers in the training database and the target users often leads to degraded recognition performance. For the BBN Hark Hidden Markov Model (HMM) based system, we have already developed a reasonably e ective technique [1] for dealing with multiple US dialects. The solu...
متن کامل